Skip to content

tsunamayo7/helix-agent

Repository files navigation

helix-agent

Cut Claude Code token usage 82-97% with local LLMs.

Demo

CI CodeQL Tests v0.15.1 Python 3.12+ License: MIT MCP Works on 8GB VRAM

The Problem

Claude Code's Max plan quota can vanish in 19 minutes. A single screenshot costs ~15,000 tokens; one DOM snapshot costs ~114,000. Retry loops burn tokens infinitely with no built-in detection -- the #1 pain point (666+ upvotes).

The Solution

helix-agent is an MCP server that compresses screenshots, DOM, and browser output through your local GPU before Claude sees them -- and detects retry loops before they drain your quota. Connect it to Claude Code and savings happen automatically; no workflow changes needed.

Measured Results

What Without With helix-agent Reduction
Screenshot analysis ~15,000 tokens ~400 tokens 97%
DOM/HTML processing ~114,000 tokens ~500 tokens 99%
Browser automation ~15,000 tokens/action ~1,000-2,700 82-93%
Retry loops Infinite (until quota dies) Stopped at 3rd repeat 100%
Routine tasks Opus tokens ($$$) Local LLM ($0) 100%

All compression runs on your local GPU via Ollama. Zero cloud API cost.

Before / After

Without helix-agent With helix-agent
Screenshot 15,000 tokens raw image 400 tokens structured text
DOM snapshot 114,000 tokens raw HTML 500 tokens action summary
Retry loop Runs until quota dies Stopped at 3rd repeat
Routine task Opus ($$$) Local Ollama ($0)
Cloud API cost $50-200/month in waste $0

Quick Start

git clone https://github.com/tsunamayo7/helix-agent.git
cd helix-agent && uv sync
ollama pull gemma4:e2b          # 8GB GPU (or e4b/26b/31b for larger)
uv run python server.py

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "helix-agent": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
    }
  }
}

Restart Claude Code. Done.

$0 cloud cost. All compression, retry detection, and delegation runs on your local GPU via Ollama. No API keys, no subscriptions, no metered billing. Your tokens stay on your machine.

How It Works

Claude Code (Opus)
    |
    +-- helix-agent (MCP server)
           |
           +-- vision_compress ---- Local LLM ----> ~400 tokens  (was 15,000)
           +-- dom_compress ------- Local LLM ----> ~500 tokens  (was 114,000)
           +-- retry_guard -------- Pure logic ----> Loop stopped (sub-ms)
           +-- think / agent_task - Local LLM ----> $0 reasoning
           +-- computer_use ------- agent-browser -> 82-93% saved
           +-- code_review -------- 4-layer LLM --> $0.20 total

Works Everywhere

Platform GPU Status
macOS (Apple Silicon) Metal / M1-M4 Tested daily
Linux NVIDIA CUDA Primary dev environment
Windows (WSL2) NVIDIA CUDA Supported via Ollama
Windows (native) NVIDIA CUDA Supported via Ollama
CPU-only None Works (slower, ~30s per compress)

Anywhere Ollama runs, helix-agent runs. 8GB VRAM minimum for GPU acceleration.

Features

  • Vision Compress -- Screenshot to structured text via local vision LLM. 15,000 tokens to 400.
  • DOM Compress -- HTML/DOM to structured extract via local LLM. 114,000 tokens to 500.
  • Retry Guard -- Detects identical tool calls before they loop. Sub-millisecond, no LLM needed.
  • GPU Auto-Detection -- Detects your GPU at startup, selects the optimal model from 8GB to 96GB+.
All 27 tools
  • Browser Automation -- Routes through agent-browser (Rust/CDP) with Playwright fallback. Native keyboard events fix React controlled components.
  • 4-Layer Code Review -- gemma4 + Sonnet + Opus + Codex pipeline catches all issues at ~$0.20.
  • Self-Evolving Memory -- Reviews conversations every 5 turns, saves reusable skills as SKILL.md files. Gets smarter over time, all local.
  • Parallel Tasks -- Run multiple tasks simultaneously with 2-axis model routing (task type x input size).
  • ReAct Agents -- Local LLM delegation with tool access, sub-agents, background workers, and JSONL tracing.

Security: PathGuard

MCP tools that delegate to local LLMs can be tricked into accessing sensitive files. PathGuard prevents this with strict path allowlists -- delegated tools can only read/write directories you explicitly permit.

Defends against CVE-2025-59536 (RCE and API token exfiltration through Claude Code project files).

# PathGuard blocks unauthorized access automatically
HELIX_ALLOWED_PATHS=/home/user/projects,/tmp

Real-World Usage

helix-agent runs in production daily on the author's own Claude Code workflow:

  • 367 tests passing (pytest, all Ollama calls mocked)
  • 17+ hour autonomous sessions with retry guard preventing quota drain
  • 27 MCP tools + 3 Resources + 3 Prompts -- full MCP spec coverage
  • Used to build helix-pilot, helix-codex, and itself (dogfooding)

GPU Auto-Detection

helix-agent auto-selects the best model for your hardware:

Your GPU VRAM Model Compress Speed
RTX 4060 8GB gemma4:e2b 10.2s
RTX 4070 Ti 16GB gemma4:e4b 11.8s
RTX 4090 / 3090 24GB gemma4:26b 14.7s
RTX PRO 6000 48GB+ gemma4:31b 27.5s

gemma4:e2b on 8GB runs 2.7x faster than 31b with comparable compression quality. No expensive GPU required.

Vision Pipeline

+--------------+     +-----------------+     +--------------+
| Screenshot   |---->| vision_compress |---->| ~400 tokens  |
| (15K tokens) |     | (local gemma4)  |     | (text only)  |
+--------------+     +-----------------+     +--------------+

+--------------+     +-----------------+     +--------------+
| DOM / HTML   |---->| dom_compress    |---->| ~500 tokens  |
| (114K tokens)|     | (local gemma4)  |     | (text only)  |
+--------------+     +-----------------+     +--------------+

Real measurement (RTX PRO 6000):

Input:  1920x1048 screenshot of X.com (~15,000 tokens)
Output: "X home feed, Japanese UI, 'For You' tab active..." (~400 tokens)
Saved:  7,362 tokens in one call

4-Layer Code Review

Automated multi-LLM review at ~$0.20 total:

Layer Reviewer Findings Cost
1 gemma4 + RAG (local) 7 $0
2 Sonnet 4.7 14 ~$0.13
3 Opus 4.7 (summary only) 16 ~$0.03
4 Codex (P1 only, on-demand) 5 ~$0.33
Combined 16+ ~$0.20

gemma4 + RAG ($0) outperforms Codex GPT-5.3 (~$0.33) in code review findings.

What Nothing Else Does

Capability helix-agent Alternatives
Screenshot to text (97% cut) Local vision LLM No MCP server does this
DOM to text (99% cut) Local LLM Playwright MCP sends raw DOM
Retry loop detection Sub-ms, no LLM No built-in Claude Code detection
GPU auto-detect + model select 8GB to 96GB+ Manual config required
Self-evolving memory SKILL.md + Qdrant Unique to helix-agent
All 3 MCP primitives 27 Tools + 3 Resources + 3 Prompts Most MCPs implement Tools only

MCP Architecture

27 tools organized by function:

Category Tools
Token saving vision_compress, dom_compress
Loop prevention retry_guard_check, retry_guard_status, retry_guard_reset
Local delegation think, agent_task, fork_task, parallel_tasks
Vision & browser see, browse, computer_use
Background agents spawn_agent, send_agent_input, wait_agent, list_agents, close_agent
Memory evolving_memory_review, list_learned_skills, get_skill, dept_search, dept_store
Code quality code_review
Meta providers, models, config, agent_types

Plus 3 Resources (helix://status, helix://models, helix://config) and 3 Prompts (retry_report, optimize_tokens, setup_guide).

Configuration

helix-agent works with zero configuration. For advanced setups:

# Environment variables (all optional)
OLLAMA_HOST=http://localhost:11434   # Ollama endpoint
HELIX_PROVIDER=ollama               # LLM provider
HELIX_LOG_LEVEL=INFO                # Logging level

Optional dependencies:

  • Qdrant -- shared memory across sessions
  • Playwright -- browser automation fallback
  • agent-browser -- recommended for 82-93% browser token savings

Requirements

  • Python 3.12+
  • uv
  • Ollama + any Gemma 4 model:
GPU VRAM Command Model Size
8GB ollama pull gemma4:e2b 4GB
16GB ollama pull gemma4:e4b 6GB
24GB ollama pull gemma4:26b 12GB
48GB+ ollama pull gemma4:31b 20GB

Related Projects

Not a Claude Code Wrapper

helix-agent is an MCP server that Claude Code connects to. It does not wrap, proxy, or re-host Claude Code or the Anthropic API. Fully compliant with Anthropic's Terms of Service.

Contributing

See CONTRIBUTING.md.

License

MIT

About

Extend Claude Code with local Ollama models — autonomous ReAct agent with auto-routing, local benchmarks, and file tools

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages